Finding Approximate Palindromes in Strings Quickly and Simply. 
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Abstract: Described are two algorithms to find long 
approximate palindromes in a string, for example a 
DNA sequence. A simple algorithm requires 0(n)- 
space and almost always runs in 0(k.n)-time where 
n is the length of the string and k is the number 
of "errors" allowed in the palindrome. Its worst- 
case time-complexity is 0(n 2 ) but this does not oc- 
cur with real biological sequences. A more complex 
algorithm guarantees 0(k.n) worst-case time com- 
plexity. 

Code of the simple algorithm will be 
placed at htt p://www.csse.monash.edu.au/| 
~Hoyd/tildeProgLang/Java2/Palindromes/ 

1 Introduction 

An (exact) palindrome, p, is a string of symbols 
that reads the same forwards and backwards, i.e. ei- 
ther p = w.w' or p = w.c.w' where w is a string, 
c is a symbol and w' = reverse(w); for com- 
plementary palindromes in DNA (RNA) we have 
w' = reverse(complement(w)) where A and T (U) 
are complementary, as are C and G. The first case, 
p = w.w' , is called an even-palindrome and the sec- 
ond, p = w.c.w' , is called an odd-palindrome and 
either the "gap" between w and w' or the symbol 
c is called the centre of the palindrome. Finding 
palindromes within a long string leads to various 
classic computing problems, e.g. the longest palin- 



drome within a string can be found in linear-time by 
using a suffix-tree (Weiner 1973, McCreight 1976). 

Palindromes can be interesting biologically 
(e.g. Tsunoda 1999, Rozen 2003) and reverse com- 
plementary palindromes are relevant to hair-pin 
loops in RNA folding. But, in biology, palin- 
dromes are often allowed to be approximate: k "er- 
rors" or "differences" are allowed between w and 
reverse(w'), that is w and reverse(w') can have 
an edit-distance of k. Note that in general one ap- 
proximate palindromic string, p, may correspond to 
multiple decompositions p = w.w' or p = w.c.w' (it 
is not necessary that \w\ = \w'\), and a decompo- 
sition may correspond to multiple alignments of w 
and reverse(w'); we prefer a cheapest decomposi- 
tion and alignment and require costs<&- 

Porto and Barbosa (2002) gave an (A; 2 n)-time 
algorithm to find long approximate palindromes in 
a string. This paper gives a simple algorithm to find 
long approximate palindromes. It runs in 0(n)- 
space and, almost always, in 0(k.n)-time; e.g. for 
fe~10, a million bases of real DNA can be processed 
in a few seconds on a p. a, most of that time be- 
ing for I/O. A more complex algorithm guarantees 
0(k.n) running time. 

[[ fig 1 near here ]] 
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2 Algorithm 

It is convenient to describe the simple algorithm in 
terms of a distance matrix, related to those used in 
some alignment algorithms. The matrix is a vari- 
ation on a triangular matrix (Figure ^) . An odd 
exact palindrome is centered on one of the cells 
marked 'O', and an even exact palindrome on one of 
the cells marked 'E'. A marked cell is called an ori- 
gin. Diagonals that run from an origin in a NE di- 
rection are important; note that an odd exact palin- 
drome corresponds to an even-numbered diagonal 
and an even exact-palindrome to an odd-numbered 
diagonal. Also important are distances along diag- 
onals (Figure 12]). It must be pointed out that the 
algorithm does not directly use a distance matrix; 
rather it operates on a different but equivalent ma- 
trix to be described. 
[[ fig 2 near here ]] 

An approximate palindrome, p, together with an 
alignment of w and reverse{w') where p = w.w' 
or p = w.c.w 1 , implying a cost, is equivalent to a 
path (Figure IHJ) which extends step by step N, E, 
and/or NE, some distance from an origin. A NE 
step represents a match or a mismatch. N and E 
steps represent indels. Each cell of the (notional) 
distance matrix holds the minimum cost of some 
optimal path from some origin, not necessarily on 
the same diagonal, to the cell. The position of any 
cell in the distance matrix specifies an approximate 
palindrome, p itself, without any associated align- 
ment; the position fixes the start and the end of 
the string p. Obviously we want the minimum-cost 
for an approximate palindrome. 

[[ fig 3 near here ]] 

The algorithm actually uses a different but 
equivalent matrix, reac/i[d][e], indexed by d which 
corresponds to a diagonal-number in the distance 
matrix, and by "error" count, e, where < e < k. 
reach[d][e] holds the maximum distance along diag- 
onal d of the distance matrix that can be reached by 
an approximate palindrome for a cost of at most e. 

The algorithm initially finds exact palindromes, 
e = 0, i.e. paths that move NE only, as long as this 



can be done for a cost of zero. It then iterates over 
the number of errors allowed, e = l..k, and, within 
that, over diagonal-number, d, where it executes the 
general step: 

reach [d] [e] = 
max (reach [d-1] [e-l]+x, 
reach [d ] [e-l]+l, 
reach [d+1] [e-1] +x) , where x=d & 1; 

while endsMatch (d, reach [d] [e] ) do 
reach [d] [e] ++ // extend for free 

It is an instance of a greedy strategy (e.g. Ukko- 
nen 1983). Other tests, not shown, check that the 
ends of the string are not overrun. On termina- 
tion, reac/i[d][/c] holds the maximum NE-erly dis- 
tance from an origin of an acceptable path ending 
on diagonal d, thus giving long approximate palin- 
dromes. 

0(re)-space is sufficient to find the approximate 
palindromes because reach[ ][e] only depends on 
reach[ ][e — 1]. If alignments (paths) are also re- 
quired, either 0(/c.re)-space is required to keep all 
of reach[ ]j ] or, probably more sensibly assuming 
path lengths << re, paths can be recovered later by 
a separate process. 

The simple algorithm's worst-case behaviour, 
0(n 2 )-time, is for strings such as A n , (AT) n l 2 , and 
similar. The cause is looping in order to check a 
run of matches to extend a path directly NE for 
zero cost; in practice the average run ends quickly 
on real DNA sequences. The complex algorithm 
is, in principle, formed by replacing the simple al- 
gorithm's loop by a constant-time step (following 
linear-time preprocessing) which uses a suffix-tree 
and a least-common-ancestor (LCA) algorithm such 
as that of Bender and Farach-Colton (2000). 

3 Results 

The simple algorithm was coded in Java and tested 
on a Linux p.c, AMD Athlon XP™ 2400+ proces- 
sor, 512MB of memory. It confirmed 0(/c.re)-time 
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complexity in practice on real DNA, e.g. processing 
chromosome 3 (1.06Mb) of the malaria organism 
Plasmodium falciparum (Gardner et al 2002) as fol- 
lows: k = 10 in 8.0s, k = 20 in 10.2s, k = 40 
in 14.3s. Such DNA is approximately 80% AT-rich 
and is the kind of real DNA most likely to cause 
problems for this kind of algorithm if any will. The 
algorithm has not been observed to make more than 
3.7(k + l)n symbol comparisons on real DNA se- 
quences. 
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Figure 1: Odd and Even Origins 
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Figure 2: Some Distances along Diagonals 




Figure 3: Example Paths 
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